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Abstract 

O ■ 

Multi-task learning models using Gaussian processes (GP) have been developed and suc- 
cessfully applied in various applications. The main difficulty with this approach is the 
computational cost of inference using the union of examples from all tasks. Therefore 
sparse solutions, that avoid using the entire data directly and instead use a set of informa- 
tive "representatives" are desirable. The paper investigates this problem for the grouped 
mixed-effect GP model where each individual response is given by a fixed-effect, taken 
I 1 . from one of a set of unknown groups, plus a random individual effect function that cap- 

tures variations among individuals. Such models have been widely used in previous work 
. but no sparse solutions have been developed. The paper presents the first sparse solution 

for such problems, showing how the sparse approximation can be obtained by maximizing 
a variational lower bound on the marginal likelihood, generalizing ideas from single-task 
" Gaussian processes to handle the mixed-effect model as well as grouping. Experiments 

using artificial and real data validate the approach showing that it can recover the perfor- 
mance of inference with the full sample, that it outperforms baseline methods, and that it 
outperforms state of the art sparse solutions for other multi-task GP formulations. 
Keywords: Multi-task Learning, Gaussian Processes, Sparse Model, Mixed-effect Model 
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1. Introduction 



X 



In many real world problems we are interested in learning multiple tasks while the training 
set for each task is quite small. When the different tasks are related one can learn all tasks 
simultaneously and aim to get improved predictive performance by taking advantage of 
the common aspects of all tasks. This general idea is known as multi-task learning and it 
has been successfully investigated in several technical settings, with applications in many 
areas including medical diagnosis (Bi et al., 2008), recommendation systems (Dinuzzo et al., 
2008) and HIV Therapy Screening (Bickel et al., 2008). 

In this paper we explore Bayesian models especially using Gaussian Processes (GP) 
where sharing the prior and its parameters among the tasks can be seen to implement 
multi-task learning (Alvarez et al., 2011; Bonilla et al., 2008; Xue et al., 2007; Gelman, 
2004; Yu et al., 2005; Schwaighofer et al., 2005; Pillonetto et al., 2010). Our focus is on 
the functional mixed-effect model (Lu et al., 2008; Pillonetto et al., 2010) where each task is 
modeled as a sum of a fixed effect shared by all the tasks and a random effect that can be in- 
terpreted as representing task specific deviations. In particular, both effects are realizations 
of zero-mean Gaussian processes. Thus, in this model, tasks share structure through hyper- 
parameters of the prior and through the fixed effect portion. This model has shown success 
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in several applications, including geophysics (Lu et al., 2008), medicine (Pillonetto et al., 
2010) and astrophysics (Wang et al., 2010). One of the main difficulties with this model, 
however, is computational cost, because while the number of samples per task Nj is small, 
the total sample size ^ ■ Nj can be large, and the typical cubic complexity of GP inference 
can be prohibitively large (Yu et al., 2005). Some improvement can be obtained when all 
the input tasks share the same sampling points, or when different tasks share many of the 
input points (Pillonetto et al., 2009, 2010). However, if the number of distinct sampling 
points is large the complexity remains high. For example, this is the case in (Wang et al., 
2010) where sample points are clipped to a fine grid to avoid the high cardinality of the 
example set. 

The same problem, handling large samples, has been addressed in single task formal- 
izations of GP, where several approaches for so-called sparse solutions have been developed 
(Rasmussen and Williams, 2005; Seeger and Lawrence, 2003; Snelson, 2006; Titsias, 2009). 
These methods approximate the GP with m <C N support variables (or inducing variables, 
pseudo inputs) X m and their corresponding function values f m and perform inference using 
this set. 

In this paper, we develop a sparse solution for multi-task learning with GP in the 
context of the functional mixed effect model. Specifically, we extend the approach of Titsias 
(2009) and develop a variational approximation that allows us to efficiently learn the shared 
hyper-parameters and choose the sparse pseudo samples. In addition, we show how the 
variational approximation can be used to perform prediction efficiently once learning has 
been performed. Our approach is particularly useful when individual tasks have a small 
number of samples, different tasks do not share sampling points, and there is a large number 
of tasks. Our experiments, using artificial and real data, validate the approach showing that 
it can recover the performance of inference with the full sample, that it performs better than 
simple sparse approaches for multi-task GP, and that for some applications it significantly 
outperforms alternative sparse multi-task GP formulation (Alvarez and Lawrence, 2011). 

To summarize, our contribution is threefold. First we introduce the first sparse solution 
for the multi-task GP in mixed-effect model. Second, we develop a variational model- 
selection approach for the proposed sparse model. Finally we evaluate the algorithm and 
several baseline approaches for multi-task GP, showing that the proposed method performs 
well. 

This paper is organized as follows. Section 2 reviews the mixed-effect GP model and 
its direct inference. Section 3 develops the variational inference and model selection for the 
sparse mixed-effect GP model. Section 4 shows how to extend the sparse solution to the 
grouped mixed-effect GP model. We discuss related work in Section 5 and demonstrate 
the performance of the proposed approach using three datasets in Section 6. Section 7 
concludes with a summary and directions for future work. 

2. Mixed-effect GP for Multi-task Learning 

In this section and the next one, we develop the mixed-effect model and its sparse solution 
without considering grouping. The model and results are extended to include grouping 
in Section 4. Consider a set of M tasks where the data for the jth task is given by 
V j = {(x{,yl)},i = 1,2,- ■• ,Nj. Multi-task learning aims to learn all tasks simultane- 
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ously, taking the advantage of common aspects of different tasks. In this paper, given data 
y = {V 1 }, we are interested in learning the nonparametric Bayesian mixed-effect model 
and using the model to perform inference. The model captures each task f 3 as a sum 
of an average effect function and an individual variation specific to the jth task. More 
precisely (Pillonetto et al., 2010): 

Assumption 1 For each j and x G X C IR d ; 

f(x) = f(x) + f(x), j = l,-..,M (1) 

where f and {f 3 } are zero-mean Gaussian processes. In addition, f and the set of {f 3 } 
are assumed to be mutually independent with covariance functions K(-, •) and K{-, •) respec- 
tively. 

Assumption 1 implies that for j, I £ {1, • • • , M}, the following holds: 

Cov[f (s), f\t)} = Cov[/», f(t)} + S jt • Cov[/>), /(*)] (2) 

where Sji is the Kronecker delta function. Let x be the concatenation of the examples from 
all tasks x = (xj), and similarly let y = (yj), where i = 1, 2, • • • ,Nj,j = 1, 2, • • • , M and 
= Y2j Nj. It can easily been seen that, for any j € {1, • • • , M} and new input x* for 
task j, we have 

"Ct(x*,x*) Ct(a;*,*) 
C\x,x*) C\x,x) + a 2 l 



M 



p{x*) 

y 

where the covariance matrix is given by 



(3) 



C\(x(), (x{)) = K(xi,x{) + S jt ■ K((xl,x{). 

From (3) we can extract the marginal distribution Pr(y) where 

y\x ~ M(0,C\x,x) + a 2 I), (4) 

which can be used for model selection, that is, learning the hyper-parameters of the GP. 
(3) also provides the predictive distribution where 

E(/V)) = C\x*,x)(C\x,x) + «r 2 I)- 1 tf 
Cov(f j (x*)) = C\x*,x*) - C^x*,x){C\x,x) + a 2 !)- 1 ^^^*). ''^ 

This works well in that sharing the information improves predictive performance but, as 
the number of tasks grows, the dimension N increases leading to slow inference scaling as 
C(A^ 3 ). In other words, even though each task may have a very small sample, the multi-task 
inference problem becomes infeasible when the number of tasks is large. 

In single task GP regression, to reduce the computational cost, several sparse GP ap- 
proaches have been proposed (Rasmussen and Williams, 2005; Seeger and Lawrence, 2003; 
Snelson, 2006; Titsias, 2009). In general, these methods approximate the GP with a small 
number m <C N of support variables and perform inference using this subset and the cor- 
responding function values f m . Different approaches differ in how they choose the support 
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variables and the simplest approach is to choose a random subset of the given data points. 
Recently, Titsias (2009) introduced a sparse method based on variational inference using a 
set X m of inducing samples, which are different from the training points. In this approach, 
the sample points X m are chosen to maximize a variational lower bound on the marginal 
likelihood, therefore providing a clear methodology for the choice of the support set. Follow- 
ing their idea, Alvarez et al. (2010) proposed the variational inference for sparse convolved 
multiple output GPs. 

In this paper we extend this approach to provide a sparse solution for the aforementioned 
model as well as generalizing it to the Grouped mixed-effect GP model (Wang et al., 2010). 
As in the case of sparse methods for single task GP, the key idea is to introduce a small 
set of m auxiliary inducing sample points X m and base the learning and inference on these 
points. For the multi-task case, each /■*(•) is specific to the jth task. Therefore, it makes 
sense to induce values only for the fixed-effect portion f m = f(X m ). The details of this 
construction are developed in the following sections. 



3. Sparse mixed- effect GP Model 

In this section, we develop a sparse solution for the mixed-effect model without group effect. 
The model is simpler to analyze and apply, and it thus provides a good introduction to the 
results developed in the next section for the grouped model. 



3.1. Variational Model Selection 

In this section we specify the sparse model, and show how we can learn the hyper-parameters 
and the inducing variables using the sparse model. As mentioned above, we introduce aux- 
iliary inducing sample points X m and hidden variables f m = f(X m ). Let P = f(x^) € JR Nj 
and P = f(x 3 ) £ JR Nj denote the values of the two functions at x 3 . In addition let C*j = 
C(x*,x 1 ), Cjj = C(x J ,x J ) and C mm = C(X m , X m ), and similarly for C*j, Cjj, C mm . 

To learn the hyper-parameters we wish to maximize the marginal likelihood Pr(j/) where 
y is all the observations. In the following we develop a variational lower bound for this 
quantity. To this end, we need the complete data likelihood and the variational distribution. 

• The complete data likelihood Pr({y J }, {P , P}, f m ) is given by: 



Pr({y*'}|{f^}) Pr({P}) Pr({f'}|f m ) Pr(f„ 
" Al 

YlPr(y j \P,P)Pv(P) 



j'=i 



Pr({f''}|f m )Pr(f„ 



We approximate the posterior Pr({f J , f- 7 }, fmlji/- 7 }) on the hidden variables by 



9({fV J '},fn 



M 



JJPr(P|P',^ 

3=1 



Pr({P}|f m )<Kf n 



(6) 



which extends the variational form used by Titsias (2009) to handle the individual 
variations as well as the multiple tasks. One can see that the variational distribution 
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is not completely in free form. Instead, q(-) preserves the exact form of Pr(f J \P , y 3 ) 
and in using Pr({f J }|f TO ) it implicitly assumes that f m is a sufficient statistic for {P}. 
The free form <p(f m ) corresponds to Pr(f m |y) but allows it to diverge from this value to 
compensate for the assumption that f m is sufficient. Notice that we are not making any 
assumption about the sufficiency of f m in the generative model and the approximation 
is entirely due to the variational distribution. An additional assumption is added later 
to derive a simplified form of the predictive distribution. 

With the two ingredients ready, the variational lower bound (Jordan et al., 1999; 
Bishop, 2006), denoted as Fv(X m ,(j)), is given by: 



Pr(y) >F v {X m ,4>) 

q({P,P},f m ) xlog 



/ 



M 



Pr({yj},{fJ,fJ},f m ) 
g ({fi,fi},f m ) 

Pr({P}|f m )«Kf m ) 



d{P}d{P}dp, 



x log 



JJPr(P|PV 

i=i 

jjPr( ? /'|f',f / )Pr(f i ) Pr(f, 



Pr(f'|f',y') <f>{fm) 
Pr(f m 



= J 0(f m )|logG(f m ,J)+log 
The inner integral denoted as log G(f m , [V) is 



<Kfm) 



d{P}d{P}di n 



I 



Y[PT(F\F,y> 



3=1 
M 



M 



Pr({f'}|f m )xJ>g 



l=i 



Pr(i/|f',f')Pr(f z ) 



Pv(P\t l ,y l ) 



d{P}d{P} 



J Pr(f J '|f J V)Pr(F'|f m ) xlog 
3=1 



Pr{yi\P,P)Pr(P) 



Pr(P\P, y . 



■3' 



(7) 



dP dP 



where the second line holds because in the sum indexed by I all the product measures 

M 

H Pr(P\P,yi)PT({r}n^i\f m A)d{P}d{P}, 

3=1,3+1 

are integrated to 1, leaving only the j-th integral. In subsection 3.1 we show that 

1. 



logG(f m ,^) = ^ 

3=1 



log 



Jsf(v>\oLj,C 33i 



-Tr 



(Cjj - Qjj)[C 33i 



i-i 



(8) 
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where a.j = C, m C m ^f m , Cjj = cr^I + Cjj, and Qjj = CjmC^Cmj. Thus we have 



j <!>$, 

I 



logG(f m ,3^) + log 



Pr(f m ) 

0(fm) 



dfrr 



(j>{tm) log 



n- A'( ? /'|a r C ; /S Pr(f„ 



(9) 



M 

Let i; be a random variable and g any function, then by Jensen's inequality IE [log g(v)] < 
loglE[p(i;)]. Therefore, the best lower bound we can derive from (9), if it is achievable, is 
the case where equality holds in Jensen's inequality. In subsection 3.2 we show that (f)(f m ) 
can be chosen to obtain equality, and therefore, the variational lower bound is 

r i M 

F v (X m , 4>) = log / H [JVVK, C^)] Pr(f ro )df m " 2 E Tr [( C n ~ Qj^n]" 1 

J 3 3=1 

Evaluating the integral by marginalizing out f TO and recalling that y is the concatenation 
of the y 3 , we get 



F v (X m , 9, 0) = log [Af(y\0, A m C^ m Al + C m )\ ~ 

3=1 



1 



-Tr 



[Cjj - Qjj)[Cjj] 1 



(10) 



where 



C2m 



M 



and 



(D C 33 

3=1 



V 



11 



,22 



'MM 



Thus, we have explicitly written the parameters that can be chosen to further optimize the 
lower bound, namely the support inputs X m , and the hyper-parameters and 6 in K and 
K respectively. By calculating derivatives of (10) we can optimize the lower bound using a 
gradient based method. In the experiments in this paper, we use stochastic gradient descent 
(SGD), which works better than the conjugate gradient (CG) in this scenario where the 
number of tasks is large. Titsias (2009) outlines methods that can be used when gradients 
are not useful. 
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3.1.1. Evaluating logG(f m ,y) 

Consider the j-th element in the sum of (7): 



Gj(P,y j ) 



Pr(F|F,^)Pr(f^|f m )log 



Pr(yi\P,P)Pr(P 
Pr(P\P,yi) 



dPdP 



Pr(f*|f^)Pr(F|f m ) 

I 'iff' yi.P) Pr(yi\P) Pt(P) 



X lo; 



dPdP 



Pr(P\P) Pr(P\P,yi) 
Pr(F|f m )log [Pr(i^'IP')] (J Pr(P\P,yi)dfAdP 

Pr(P|f m )log [Pr(yi\P)] dP = E [fi , fm] log [Pr&\P)] 

where the third line holds because of the independence between P and P. We next show 
how this expectation can be evaluated. This is more complex than the single-task case 
because of the coupling of the fixed-effect and the random effect. 
Recall that 



Pr(f^|f m ) — j\T (P \ Cj m C mm { m , Cjj C j m C mrn C rn j t 



and 



y j \P ~ M(P,C 



where Cjj = a z I + Cjj. Denote C,-,- = L T L where L can be chosen as its Cholesky 
decomposition, we have 



jj 



log [Prty |f^)] = --(y - P'fC^V - P) + log 

-(Ly j - LP) T (Ly j - LP) + log 



N J ' 






(27T)-* 


+ log 


'\Cjj\-?_ 



(27T)" 



+ log 



Notice that 

Pr(LP\f m ) = Af(LC jm C^ m f m ,L(Cjj - Qjj)L T ) 

where Qjj = Cj m C^ m C m j. Recall the fact that for x ~ A/"(/x, S) and a constant vector a, 
we have E[[|o - x\\ 2 ] = \\a - n\\ 2 + Tr(S). Thus, 



E [fJ|fm] log [Pr(^ ; \P)} = --\\Lyi - LC jm C^ n 



~Tr{L{Cjj - Qjj)L T ) + log 



(2tt)- 



+ l0g |C#| 5 



+ log 



(27T)- 



+ log 



-TV [L(Cjj - Qjj)I? 



log 



J\f(y J \aj, C J3/ 



2 



(Cjj - Qjj)C 



J J 
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where ocj = Cj m C r J m f m . Finally, calculating ^ - Gj(P , y J ) we get (8). 



3.1.2. Variational distribution 0*(f m ) 

For equality to hold in Jensen's inequality, the function inside the log must be constant. In 
our case this is easily achieved because (j>(f m ) is a free parameter, and we can set 

IL fWlajAj)! Pr(f m 



yielding the bound given in (10). Setting 0(f m ) oc f\j -^(v^^j^jj) P r (fm) yields the 
form of the optimal variational distribution 



f£m) oc 



n[Wl«j,C^)] Pr(f m ) 



x 



from which we observe that d>*({ m ) is 



N\ f„ 



(11) 



where $ = C mm + ^ • C m j[C^-] C jm . Notice that by choosing the number of tasks to be 

1 and the random effect to be a noise process, i.e. K(s,t) = a 2 5(s,t), (10) and (26) are ex- 
actly the variational lower bound and the corresponding variational distribution in (Titsias, 
2009). 



3.2. Prediction using the Variational Solution 

Given any task j, our goal is to calculate the predictive distribution of f J (x*) = f(x*) + 
f J (x*) at some new input point x*. As described before, the full inference is expensive and 
therefore we wish to use the variational approximation for the prediction as well. The key 
assumption is that f m contains as much information as y in terms of making prediction for 
/. To start with, it is easy to see that the predictive distribution is Gaussian and that it 
satisfies 

npi^m =nf(x*)\y}+nf j (x*)\y) (12) 

Var[/^V)|y] = Var [/(**) |#] + Var[P(x*)\y] + 2Cov[/>*)/^ (**)!#]■ 

The above equation is more complex than the predictive distribution for single-task sparse 
GP because of the coupling induced by f(x*)f J (x*)\y. We next show how this can be 
calculated via conditioning. 

To calculate the terms in (12), three parts are needed, i.e., Pi(f(x*)\y), Pv(f(x*)\y) 
and Cov[f(x*)fj(x*)\y]. Using the assumption of the variational form given in (6), we 
have the following facts, 
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1. i m \y ~ 0*(f m ) = A/"(/x, A) where fi and A are given in (11). 

2. f m is sufficient for {f J }, i.e. Pv({P}\i m ,y) = Pr({f J }|f m ). Since we are interested 
in prediction for each task separately, by marginalizing out f l ,l ^ j, we also have 
Pr(P|f m ,y)=Pr(P'|f m ) and 

f J |fm) V ~ N (Cj' m C m J rl f m , Cjj — CjmCj^Cmj) . (13) 

3. For f 3 (x*) we can view y 3 — P as noisy realizations from the same GP as f 3 (x 3 ) and 
therefore 







1 C*j 





-1 



[y 3 -P] ,C**-C 



-i 



(14) 



In order to obtain a sparse form of the predictive distribution we need to make an 
additional assumption. 

Assumption 2 We assume that f m is sufficient for f(x*), i.e., 

Pr(/>*)|f m ,y)=Pr(/>*)|f m ), 

implying that 

f( x )|f"m)J/ ~ A/" (C* m C mm f m , C** — C^ m C mm C m ^) . (15) 

The above set of conditional distributions also imply that f(x*) and f 3 (x*) are independent 
given f m and y. 

To evaluate (12), we have the following 

• We can easily get Pi{f {x*)\y) by marginalizing out f m |y in (15), 

Pr(/>*)|y) = J Pr(/>*)|f m )^(f m )df m 

yielding 

f( x *)\y ~ -A/" I C* m C m J ri //, C** — C* m C mm C TO * + C tm C m J„ AC m J„ C mt ] . (16) 



• Similarly, we can obtain Pr(/(a;*)|y) by first calculating Pr(f J |y) by marginalizing 
out f m \y in (13) and then marginalizing out P\y in (14), as follows. First we have 
P\y ~ N(C jmCj^n, B) where 

= ^33 ~ Cjm,C m mCrnj + Cj m C mm AC mm C m j. 

Next for Pr(f(x*)\y), we have 



Pr(f 3 (x*)\y) = J Pi(f j (x*)\P,y 3 )Pr(P\y)dP 
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and marginalizing out f J , f(x*)\y can be obtained as 

-l 



TV C 



J *,i 



+ c 



*j 



y~* ^jmC mm fj^ , c** c 
-i 



*j 



CjmCmm X BC mm C m j ( Cjj + (Tjlj 



-1 



-1 



• Finally, to calculate Cov[f(x*)f j (x*)\y] we have 

Cov[f(x*)fj(x*)\y]=JE[f(x*)-f(x*)\y] - E[/>*)|$E [/>*)!#] 



where 



JE[f(x*) ■ p(x*)\y] = E^Ef/V) • p{x*)\i m ,y] 



E 



fm|y 



m[f{x*)\i m ] ■E[f( !E *)|f m) ^] 



(17) 



(18) 



where the second line holds because, as observed above, the terms are conditionally 
independent. The first term E [/• ? (cc*)|f TO ] can be obtained directly from (15). By 
marginalizing out f-^|f m in (14) such that 



Vr{P{x*)\f m ,y*) = J PiiP(x*)\V,y)Pr(P\i m )dV, 
we can get the second term. This yields 



Af(c*j 
+ C 



Cjj + a]lj 



(y J — Cy ni c mTO f m ) , c** — c*j Cjj + ojij 



c jj + (J j I j 



C{Cjj + a]lj 



Co* 



(19) 



where C = Cjj — Cj m C rr) L rn C m j. To simplify the notation, let H = C* m C r J m , F 
-l 



C#j ^Cjj + <TjIj J and G = Cj m C TO ^j. Then (18) can be evaluated as 

Hy J F • E[f m ] - FG (E [f m f^|y] ) H T = H^F • /i - FG [A + /x/x T ] H T . 

We have therefore shown how to calculate the predictive distribution in (12). The complex- 
ity of these computations is 0(Nj + m 3 ) which is a significant improvement over 0(N 3 ) 
where N = M x Nj. 

4. Sparse Grouped mixed-effect GP Model (GMT-GP) 

In this section, we extend the mixed-effect GP model such that the fixed-effect functions 
admit a group structure. We call this Grouped mixed-effect GP model (GMT-GP). More 
precisely, each task is sampled from a mixture of shared fixed-effect GPs and then adds its 
individual variation. We show how to perform the inference and model selection efficiently. 



10 



NONPARAMETRIC BAYESIAN MlXED-EFFECT MODEL: A SPARSE GAUSSIAN PROCESS APPROACH 




Figure 1: Plate graph of the GMT-GP. Blue nodes denote observations, green ones are 
(hyper)parameters and the gray ones are latent variables. 



4.1. Generative Model 

First, we specify the sparse GMT-GP model, and show how we can learn the hyper- 
parameters and the inducing variables using this sparse model. 



Assumption 3 For each j and x £ X , 

f(x)=Ux) + f(x), j = l, 



,M 



where {fk}, k = 1, • • • , K and f 3 are zero-mean Gaussian processes with covariance function 
KLk and K,, and Zj £ {1, • • • ,K}. In addition, {fk} and {f 3 } are assumed to be mutually 
independent. 

The generative process (shown in Fig. 1) is as follows, where Dir and Multi denote the 
Dirichlet and the Multinominal distribution respectively. 

1. Draw the processes of the mean effect: fk(-)\9k ~ 07^(0, f£k('i •))> k = 1,2, ■ ■ ■ , K; 

2. Draw ir\ao ~ Dir(o>o); 

3. For the j-th task (time series); 

• Draw zj\tt ~ Multi(7r); 

• Draw the random effect: f j (-)\0 ~ gV(0,IC(-, •)); 



Draw y J \zj, f 3 ,x 3 , cr| ~ N if 3 (x J ) , cj| • Ij J , where f 3 = f z . + f 3 and where to 
simplify the notation Ij stands for I_/v - 



4.2. Variational Model Selection 

In this section we show how to perform the learning via variational approximation. The 
derivation follows the same outline as in the previous section but due to the hidden vari- 
ables Zj that specify group membership, we have to use the variational EM algorithm. As 
mentioned above, for the A:-th mixed-effect (or center), we introduce m,k auxiliary inducing 
support variables and the hidden variable rjk = /^(A^), which is the value of k-th. 
fixed-effect function evaluated at X?i . 
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Let f k = f k (x) € JR N denote the function values of the fc-th mean effect so that 
ffc = fk{ x ^) G M^' is the sub- vector of corresponding to the j-th task. Let P = 
f(xi) € TR Nj be the values of the random effect at xK Denote the collection of the 
hidden variables as J = {f k },T = {P},H = {Vk}, Z = {zj}, and ir. In addition let 
C% = TC k {x*,x j ), Q)j =JC k (x j z x j ),C jk = lC k (x j ,Xl) and C kk = /C fc (A*,A*), and 
similarly C*j = K,(x* , a;- 7 ), Cjj = /C(£C J , jr J ) and Cjj = Cjj + a?Ij where Ij stands for 1^. 

To learn the hyper-parameters we wish to maximize the marginal likelihood Pr(y) where 
y is all the measurements. In the following we develop a variational lower bound for this 
quantity. To this end, we need the complete data likelihood and the variational distribution. 
The complete data likelihood is given by 



where 



Pr(y, ff, H, Z,tt) = Pr(#|ff, J 7 , Z) Pr(ff|H) Pr(Z|7r) Pr(7r) Pt(T) Pr(H) (20) 



K M 

pr(H) = n pr (^)> pr (^) = n pr (f- 
*;=i j=i 

M X 

Pr(7r) = Dir(7r|a ), Pr(Z|7r) = II II ^ 

j=ifc=i 

K M K 

Pr($|H) = n Pr (f*l»?fc). Pr(y|5,-F,Z) = II II [lW'|f j ,ffc) 



k=l j=l k=l 



where, as usual {zj k } represent z. unit vector. 

We approximate the posterior Pr(#, J", H, Z,ir\y) on the hidden variables using 

q(!S,r,H,Z,*)=q(!S,r,H\Z)q(Z)q(ic) (21) 

where 

P, H|Z) = Pr(^|ff, Z, y) Pr(ff|H)*(H) 

MX X 

= n n [ p r(f j if fe ,y j )]^ n pr ^i^)^(^)- 

j=ik=i k=i 

This extends the variation form of the previous section. Our use of f k as the complete set 
of observation when the true group is k makes for convenient notation of simplifying the 
derivation. 
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The variational lower bound, denoted as Fy, is given by: 

Pr(y,^,^,H,Z,7T 



Pr(y) ^ F v = J q($, F, H, Z, ir) x log 
= j g (ir)g(Z)g(ff,^,H|Z) 
x log 



9 (&.F,H,Z,ir) 



d^dFd'RdZd'K 



Pr(y|ff, Z) Pr(£|H) Pr(Z|7r) Pr(7r) Pr(F) Pr(H) 



(ZWir)loe 



q($,F,U\Z)q(Z)q(ir) 
Pr(7r)Pr(Z|7r) 



d$dFd~KdZdir 



q(Z)q(ir) 



dndZ 



+ 



J q(Z)q($,F,II\Z)\og 



Pr(y\d, J 7 , Z) Prg|H) Pr(F) Pr(H) 
'/UV.-/MI Z) 



d$dFdHdZ 



To begin with, we evaluate the second term denoted as Fy2, as follows. The term inside 
the log can be evaluated as 



A _ Pr(y F, Z) PrfflH) PrQF) Pr(H) 
q(3,F,n\Z) 



Uj,k 


Pr(y j \f j ,f k ) 


Zjk n fe pr(f fe fe)n,pr(f j )n fc pr(^) 


Ylj,k 


Pr(f%,^)" 





m K 

nn 

j=l k=l 



Pr(y%,P)Pr(P 



z jk K 



n 



Pr(fJ|f fc ,yJ") 
Thus, we can write Fy 2 as 

F y2 = J q(Z)q% -F,H|Z)(logA) d$dFd*ldZ 

= f q(Z) f 11 <f>{r, k ) | log G(Z, H, y) + £ log 
r fc=i I k =i 



<t>{r)k) _ 



dn 



dz, 



where 



logG(Z,H,y) = J Pr(J-|ff,Z)Pr(ff|H)log 



M K 

nn 

j=ik=i 



Pr(y%,fJ)Pr(fJ) 
Pr(fJ'|f fc ,yJ') 



We show below that log G(Z, H, y) can be decomposed as 

M K 

log G(Z, H, y) = E *i* lo S ^')> 
j=i fe=i 
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where 



logG(i ?fc ,^)=log[W|aJ,C ii ) 



Tr 



(22) 



where a. k = CjkC kk r]j. and Qjf- = CjkC kk Ckj- Consequently, the variational lower bound 



is 



Fy = j q{Z)q{K)\og 



Pr(7r)Pr(Z|7r) 
9(Z)g(ir) 



dTrdZ 



+ j q(Z) J f[Hm) jlogG(Z,H,#) + ^log 



Pr(*7fc) 



dZ 



fe=i l fc=i 
To optimize the parameters we use the variational EM algorithm. 

• In the Variational E-Step, we estimate q*(Z),q*(Tr) and {4>*{r]k)}- 

To get the variational distribution <?*(Z), we take derivative of Fy w.r.t. q(Z) and set 
it to 0. This yields 

logg*(Z)= f g(7r)log(Pr(Z|7r))d7r+ f JJ 4>{r\ k ) log G(Z, H, y)dH 



k=l 



M K 



J212 z i k [^(irJPogTft] +' SE '4>(n k )[ lo sG(Vk,y : ')]] + const 

3=1 k=l 



from which we can derive 



M K 



(z>=nn 



jk ' 



j=l k=l Ljk=l Pjk 

logp jk = IE^) [log vr fc ] +TE (f){Vk) [logG(r)k,y :) )}, 

where fE q ( n ) [log vr^.] = ip(atk) ~ i J iYlk a k) where ip is the digamma function, a& is 
defined below in (23), and IE^^flogG^,?/- 7 )] is given below in (34). 

For the variational distribution of q*(jr) the derivative yields 

log <f (7r) = log Pr(7r) + J q(Z) log(Pr(Z|-7r))d7r + const 

K M K 

= (a - 1) ^2 log(vr fc ) + ^ ^ TE[zjk] log vr fc + const 

fe=l j=l k=l 

and taking the exponential of both sides, we have 

q*(ir) = I)iri7r a ) 
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where 



K 



a k = a + N k , N k = '^2 r jk- 

3=1 



(23) 



The final step is to get the variational distribution of (p*(r) k ), k = 1, • • • , K. Notice 
that only Fy2 is a function of <p{r)k)- We can rewrite this portion as 



r K I I r M K K 
/ II ttm) I / ?(z) E E *i* N y j )) dz \ + £ log 

^ fc=l \ I ^ 1 = 1 fe=l fc=l 



M if 



0(»?fc) 



A 



11^) ( EE E ?(z)M [logG^,^)] +E lo § 

j=l k=l 



k=l 



K 

E / 

k=l J 



M 



E E 9(z) fed logG^fe,^; 



log 



fc=l 



Pr(r? fc ) 
dr} k . 



(24) 



Thus, our task reduces to find 4>*{r) k ) separately. Taking the derivative of (24) w.r.t. 
(p(rfk) and setting it to be zero, we have 



M 

logc/>*(r? fc ) = ^2lE q{z) [z jk ] log G(r) k , y 3 ) + log Pr(j? fc ) + const. 
Using (22) and the fact that second term in (22) is not a function of rj k , we obtain 

M 



3=1 



(25) 



Thus, we have 



J>*(r, k ) oc exp l~(v k f (Cj£*Cj£)ii* + faf U^^MC^]-^ 
where 

M 

* = C kk + E ^'g(Z)[ z jA:]Cfc. ; -[Cj. ; -]~ 1 Cjfc. 

Completing the square yields the Gaussian distribution 



"(i?fe) =M{n k ,T, k ), 



where 



M 



Hk — Cfcfc* 1 ^ E 9 ( Z )[2;jfc]Cfej[Cjj;] S fc — Cfcfc* 1 C fcfc . (26) 



3=1 
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In the Variational M-Step, based on the previous estimated variational distribution, 
we wish to find hyperparameters that maximize the variational lower bound Fy. The 
terms that depend on the hyperparameters and the inducing variables {X^} are given 
in (24). Therefore, using (22) again, we have 



K 

F v (X k ,0)=J2 J <t>*(v 



k=l 
K 



M 



^2r jk logG(ri k ,y 3 
3=1 



+ log 



J2 J f(Vk) log 
fe=i I 



M 



3=1 



+ log 



(m) 

Prfafc) 



dr]k 



dr)k 



K m 



k=l 3=1 



K 



k=l 



£ / <i>*(vk) log 



<t>*(rik) 



dVk 



K m 



k \rt-l 

30 



k=l j=l 



From (25), we know that the term inside the log is constant, and therefore, extracting 
the log from the integral and cancelling the </>* terms we see that the fc'th element of 
first term is equal to the logarithm of 



M 



(27) 



3 = 1 

We next show how this multivariate integral can be evaluated. First consider 



(2tt)— r|C 



30 



exp 



{-f(v i -°'S) T ic ii r 1 (v i 



«*)} 



(2vr)- — |C 



33 \ 



-1 



exp J (yi-ajf r^C,, - aj) 



(2tt)- t |Ci 



-•(2^)--^^, 



(2 7 r)-*|r- 1 C ii |-^ 



exp <{ -^(v 3 ~ a 



k\T 
3 I 



r jk C 00 



-1 



{y ] - ex*) 



where A jk = (r jk ) 2 (2tt) 

M 



2. , , N 3 i - 1 ~ r jk) p ^'jfc 

2 Cj 7 - 2 . Thus, we have 



n[^'i«r e «) 



A/ 

3=1 



M 



n^'i«i.^ e ii)' 
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The first part is not a function of r) k , for the integration we are only interested in the 
second part. Since y is the concatenation of all y^'s, we can write 



M 



n^'l«i. rjfiii) = m\*kC^ Vk , C k ), 



(28) 



where 



&2k 
\C-MkJ 



M 

and C fc = 0r7 fc iC 

3=1 



V 



11 



r Mk^MM/ 



Therefore, the integral can the written as the following marginal distribution of 
Pv(y\k), 

M 

IJ^Ia^rJ^C^Pr^fc)^ = / M(y\A k C^r, k ,C k )Pv( Vk )dr, k = Pv(y\k)- 

(29) 

Using the fact that Pr(r} k ) = J\f(0, C kk ) and observing that (28) is a conditional 
Gaussian, we have 

Pr(y|fc)=AA(0,A fc C- fc 1 A^ + C fc ). 
Using this form and the portion of Aj k that depends on the parameters we get 



K 



K M 1 KM 

F v (X,6) = J2logPT(y\k) + Y,E ' J ' ' * ' 



k=i 

K 



, lo § iCiii - 2 E E ^ [(<& - Q'i,)c :l ; 

k=l j=l k=l 3=1 

K M 



M 



E lo s ^(m + ^ E ^ le* i - 1 E E ^ ^ - Q" 



2 ^ 01 JJ ' 2 

fe=l j=l k=l j=l 



(30) 



This extends the bound for the single center K = 1 case given in (10). Furthermore, 
following the same line as the previous derivation, the direct inference for the full 
model can be obtained where r/ k is substituted with f k and the variational lower 
bound becomes 



A 



K-l 



M 



F v (X,0) = ]TlogA/-(y|0,C fcfc + C fc ) + — — £log \C 3j \. 



k=l 



3=1 



We have explicitly written the parameters that can be chosen to further optimize 
the lower bound (30), namely the support inputs {Af^}, and the hyper-parameters 6 
which are composed of {0 k } and {6} in K k and K respectively. 

By calculating derivatives of (30) we can optimize the lower bound using a gradient 
based method. It is easy to see that the complexity for calculating the derivative of 
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the second and third terms of (30) is 0{N). Thus, the key computational issue of 
deriving a gradient descent algorithm involves computing the derivative of logPr(y|&). 
We first show how to calculate the inverse of the N x N matrix T = A^C^A^ + C k . 
Using the matrix inversion lemma (the Woodbury identity), we have 



(A k c k U T k + c k y 



[c 



[C^A,, ( C kk + A'^CT'Afc ) A k [CT L - 



-1 



. TrrSfci-1 



Since C k is a block-diagonal matrix, its inverse can be calculated in ^ - O(Nj). Now, 

Cjtfc+A^[C fc ] _1 Afc is an m k x m k matrix where m k is the number of inducing variables 
for the fc-th mean effect. Therefore the computation of (30) can be done in 0{m\ + 
Nj + Nm\). Next, consider calculating the derivative of the first term. We have 

»iM*l*)_Wi«T->,-W'£), 



89 j 2* 



BO, 



d9i 



where, by the chain rule, we have 



dr _ d\ k j T x gCfcfc x T tdAToc 

~d¥j ~ ~~89~ kk k ~ kk 89 j kk ^ k ^kk~^~ " 



89, 



89, ■ 



Therefore, pre-calculating y T Y~ 1 and sequencing the other matrix operations from 
left to right the gradient calculation for each hyperparameter can be calculated in 
0(Nml). In our implementation, we use stochastic coordinate descent, where at each 
iteration, one coordinate (parameter) is chosen at random and we perform gradient 
descent on that coordinate. 

4.3. Evaluating log G(Z, H, y) 

In this section, we develop the expression for logG(Z, H, y). 



logG(Z,H,y) 

. M K 

l=l;p=l 
M K r , / K 



K 



v=l 



M K 



3=1 fc=l 



Pr(^|f fc ,fJ)Pr(f 
Pr(fJ'|f fc ,l^) 



j=l k=l 
M K 

Y.Y. z i k 

3=1 k=l 
M K 

3=1 k=l 
m K 

3=1 k=l 



K 



n Pr ^l^)) xP^lf fc ,y) xlog 
\v=l 

Pr(f j |W) x 
Pr(P|f fc ,^')Pr(f fc |7 ?fc ) xlog 



Pr(yJ"|fc,fJ)Pr(fJ) 



Pr(ffcl^) II V*(U\Vv)d$- k 
Pr(yi|f fe ,fi)Pr(fi) 



x loo 



d$df j 

Pr(^|f fc ,P)Pr(P; 
Pr(f%,y*) 



PT(P\f k ,y3 



df k dP 



(31) 
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where the second line holds because in the sum indexed by j and k all the product measures 

M K 

n n[^%>y i )} zip > 

i=i,ift P =i 

are integrated to 1, leaving only the Pi(P\f k ,y J ). Our next step is to evaluate log G(rf k , Vj), 
we have 



Pr(F|f fe ,^)Pr(f fe |» ?fe ) xlog 



Pr(^'|f fc ,fJ)Pr(P) 



PT(F\f k ,yi)PT(fk\rik)xlog 



Pr(f%,^) 
Pr(y>'|f fc ,P)Pr(F 



Pr(l^'lffc) 



Pr(y%,ff)Pr(f%) 
Pr(f fc |»? fe )log [Pr(^'|f fc )] rff fc = y Pr(F|77 fe )log [Pr(^|F)j dP 



df k dP 
(32) 



where the last line holds because of the independence between P and f k . We next show how 
this expectation can be evaluated. This is more complex than the single-task case because 
of the coupling of the fixed-effect and the random effect. 

Recall that Pv{P\ m ) = jV(P\C jk C-^ k , C^.-C^C^C^). Denote C" 1 = L T L where 
L can be chosen as its Cholesky factor, we have 



log [Pr(y j \P)] = --(Lrf - LP) T (Lyi - LP) + log 









(2vr)-^ 


+ log 





Notice that Pr(Lf% fe ) = J^{LC jk C^ri k ,L(C^ - Q^)L T ) where Q)- = C jk C~^C kj . 
Thus, 



logGfafc.y*) = E [fJ -|, fc] log [Pv(y j \P] 



Lyi - LC jk C kk r) k 
log Afiy^aj^jj 



Tr{L{C%-Q%)L T ) + \og 



(27T)" 



+ log 



TV 



where a. k - = CjkC^r/k- Finally, we have 

m K 

log [jS/VlajA;) 
Furthermore, marginalization out rf k , we have 



m K 

logG(H,y) = ^5> fe 

j=l k=l 



-Tr 



33 



(33) 



M r(r)k) logG(rj k ,y j ) = log \jV(y j \^,C j:j ) 



c ifc C fcfc( S fc ~ c fcfc)C fcfc 1 C i feC - 1 (34) 



19 



Wang Khardon 



4.4. Prediction Using the Sparse Model 

The proposed sparse model can be used for two types of problems. Prediction for existing 
tasks and prediction for a newly added task. We start with deriving the predictive distribu- 
tion for existing tasks. Given any task j, our goal is to calculate the predictive distribution 
Pr(/ J (x*)\y) at new input point x*, which can be written as 

K K 

^Pr(.f = l,y)Pr(z jk = l\y) = £ r jk Prtf (x*)\z jk = l,y). (35) 

k=l k=l 

That is, because Zjk form a partition we can focus on calculating Pr(/ J (a;*)| Zjk = l,y) 
and then combine the results using the partial labels. Calculating (35) is exactly the same 
as the predictive distribution in the non-grouped case, the derivation in Section 3.2 gives 
the details. The complexity of these computations is 0(K(Nj + m 3 )) which is a significant 
improvement over 0(KN 3 ) where N = YljNj- Instead of calculating the full Bayesian 
prediction, one can use Maximum A Posteriori (MAP) by assigning the j-th task to the 
center c such that c = argmax Pr(zjk = 1 j j/) . Preliminary experiments (not shown here) 
show that the full Bayesian approach gives better performance. Our experiment below is 
the results of Bayesian prediction. 

Our model is also useful for making prediction for newly added tasks. Suppose we are 
given {x M+1 ,y M+1 } and we are interested in predicting f M+l (x*). We use the variational 
procedure to estimate its partial labels w.r.t. different centers Pr(zjv/+i,fc = and then 
(35) can be applied for making the prediction. In the variational procedure we update the 
parameters for Zm+i but keep all other parameters fixed. Since each task has small number 
of samples, we expect this step to be computationally cheap. 

5. Related Work 

Our work is related to (Titsias, 2009) particularly in terms of the form of the variational 
distribution of the inducing variables. However, our model is much more complex than 
the basic GP regression model. With the mixture model and an additional random effect 
per task, we must take into account the coupling of the random effect and group specific 
fixed-effect functions. The technical difficulty that the coupling introduces is addressed in 
our paper, yielding a generalization that is consistent with the single-task solution. 

The other related thread comes from the area of GP for multi-task learning. Bonilla et al. 
(2008)proposed a model that learns a shared covariance matrix on features and a covariance 
matrix for tasks that explicitly models the dependency between tasks. They also presented 
techniques to speed up the inference by using Nystrom approximation of the kernel matrix 
and incomplete Cholesky decomposition of the task correlation matrix. Their model, which 
is known as the linear coregionalization model (LCM) is subsumed by the framework of 
convolved multiple output Gaussian process (Alvarez and Lawrence, 2011). The work of 
Alvarez and Lawrence (2011) also derives sparse solutions which are extensions of different 
single task sparse GP (Snelson, 2006; Quihonero-Candela and Rasmussen, 2005). Our work 
differs from the above models in that we allow a random effect for each individual task. As 
we show in the experimental section, this is important in modeling various applications. If 
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the random effect is replaced with independent white noise, then our model is similar to 
LCM. To see this, from (35), we recognize that the posterior GP is a convex combination of 
K independent GPs (mean effect). However, our model is capable of prediction for newly 
added tasks while the models in (Bonilla et al., 2008) and (Alvarez and Lawrence, 2011) 
cannot. Further, the proposed model can naturally handle heterotopic inputs, where dif- 
ferent tasks do not necessarily share the same inputs. In (Bonilla et al., 2008), each task 
is required to have same number of samples so that one can use the property of Kronecker 
product to derive the EM algorithm. 

6. Experimental Evaluation 

Our implementation of the algorithm makes use of the gpml package (Rasmussen and Nickisch, 
2010) and extends it to implement the required functions. For performance criteria we use 
the standardized mean square error (SMSE) and the mean standardized log loss (MSLL) 
that are defined in (Rasmussen and Williams, 2005). We compare the following methods. 
The first four methods use the same variational inference as described in Section 4. They 
differ in the form of the variational lower bound they choose to optimize. 

1. Direct Inference: use full samples as the support variables and optimize the marginal 
likelihood. When K = 1, the marginal likelihood is described in Section 3 and the 
predictive distribution is (5). 

2. Variational Sparse GP for MTL (MT-VAR): the proposed approach. 

3. MTL Subset of Datapoints (MT-SD): a subset Af* of size mk is chosen uniformly 
from the input points from all tasks x for each center. The hyper-parameters are 
selected using A^ (the inducing variables are fixed in advance) and their corresponding 
observations by maximizing the variational lower bound. We call this MT-SD as a 
multi-task version of SD (see Rasmussen and Williams, 2005, chap. 8.3.2), because 
in the single center case, this method uses (4) and (5) using the subset X m ,y m and 
x^yi as the full sample (thus discarding other samples). 

4. MTL Projected Process Approximation (MT-PP): the variational lower bound 
of MT-PP is given by the first two terms of (30) ignoring the trace term, and therefore 
the optimization chooses different pseudo inputs and hyper-parameters. We call it 
MT-PP because in the single center case, it corresponds to a multi-task version of 
PP (see Rasmussen and Williams, 2005, chap. 8.3.3). 

5. Convolved Multiple Output GP (MGP-FITC, MGP-PITC): the approaches 
proposed in (Alvarez and Lawrence, 2011). For all experiments, we use code from 
(Alvarez and Lawrence, 2011) with the following setting. The kernel type is set to 
be gg. The hyperparameters, parameters and the position of inducing variables are 
obtained via optimizing the marginal likelihood using a scaled conjugated gradient 
algorithm. The support variables are initialized as equally spaced points over the 
range of the inputs. We set the R q = 1, which means that the latent functions share 
the same covariance function. Whenever possible, we set Q which, roughly speaking, 
corresponds to the number of centers in our approach, to agree with the number of 
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centers. The maximum number of iterations allowed in the optimization procedure is 
set to be 200. The number of support variables is controlled in the experiments as in 
our methods. 

Three datasets are used to demonstrate the empirical performance of the proposed 
approach. The first synthetic dataset contains data sampled according to our model. The 
second dataset is also synthetic but it is generated from differential equations describing 
glucose concentration in biological experiments, a problem that has been previously used to 
evaluate multi-task GP (Pillonetto et al., 2010). Finally, we apply the proposed method on 
a real astrophysics dataset. For all experiments, the kernels for different centers are assumed 
to be the same. The hyperparameter for the Dirichlet distribution is set to be ao = 1/K. 
Unless otherwise specified, the inducing variables are initialized to be equally spaced points 
over the range of the inputs. To initialize, tasks are randomly assigned into groups. We run 
the conjugate gradient algorithm (minimize .m) on a small subset of tasks (100 tasks each 
having 5 samples) to get the starting values of hyperparameters of the K, and /C, and then 
follow with the full optimization as above. Finally, we repeat the entire procedure 5 times 
and choose the one that achieves best variational lower bound. The maximum number of 
iterations for the stochastic coordinate descent is set to be 50 and the maximum number of 
iterations for the variational inference is set to be 30. The entire experiment is repeated 10 
times to obtain the average performance and error bars. 

6.1. Synthetic data 

In the first experiment, we demonstrate the performance of our algorithm on a regression 
task with artificial data. More precisely, we generated 1000 single-center tasks where each 
f 3 (x) = f(x) + f 3 (x) is generated on the interval x € [—10, 10]. Each task has 5 samples. 
The fixed-effect function is sampled from a GP with covariance function 

Cov[/(ti),/(* 2 )] = e -^-*>) a / 2 . 

The individual effect f 3 is sampled via a GP with the covariance function 

Cov[f (h),f(t 2 )] = 0.25e-^~^ 2 / 2 . 

The noise level a 2 is set to be 0.1. The sample points x 3 for each task are sampled uni- 
formly in the interval [—10, 10] and the 100 test samples are chosen equally spaced in the 
same interval. The fixed-effect curve is generated by drawing a single realization from the 
distribution of f while the {i 3 } are sampled i.i.d. from their common prior. We set the 
number of latent functions Q = 1 for MGP. 

The results are shown in Fig. (2). The top row shows qualitative results for one run using 
20 support variables. We restrict the initial support variables to be in [—7, 7] on purpose to 
show that the proposed method is capable of finding the optimal inducing variables. It is 
clear that the predictive distribution of the proposed method is much closer to the results of 
direct inference. The bottom row gives quantitative results for SMSE and MSLL showing 
the same, as well as showing that with 40 pseudo inputs the proposed method recovers the 
performance of full inference. The MGP performs poorly on this dataset, indicating that it 
is not sufficient to capture the random effect. We also see a large computational advantage 
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over MGP in this experiment. When the number of inducing variables is 20, the training 
time for FITC (the time for constructing the sparse model plus the time for optimization) 
is 1515.19 sec. while the proposed approach is about 7 times faster (201.81 sec.) 1 . 

6.2. Simulated Glucose Data 

We evaluate our method to reconstruct the glucose profiles in an intravenous glucose tol- 
erance test (IVGTT) (Vicini and Cobelli, 2001; Denti et al., 2010; Pillonetto et al., 2010) 
where Pillonetto et al. (2010) developed an online multi-task GP solution for the case where 
sample points are frequently shared among tasks. This provides a more realistic test of our 
algorithm because data is not generated explicitly by our model. More precisely, we apply 
the algorithm to reconstruct the glucose profiles in an intravenous glucose tolerance test 
(IVGTT) where blood samples are taken at irregular intervals of time, following a single 
intravenous injection of glucose. We generate the data using minimal models of glucose 
which is commonly used to analyze glucose and insulin IVGTT data (Vicini and Cobelli, 
2001), as follows (Denti et al., 2010) 



where D denotes the glucose dose, G(t) is plasma glucose concentration and I(t) is the 
plasma insulin concentration which is assumed to be known. G b and I b are the glucose 
and insulin base values. X(t) is the insulin action and 5{t) is the Dirac delta function. 
Sg, Si,p2, V are four parameters of this model. 

We generate 1000 synthetic subjects (tasks) following the setup in previous work: 1) 
the four parameters are sampled from a multivariate Gaussian with the results from the 
normal group in Table 1. of (Vicini and Cobelli, 2001), i.e. 



2) I(t) is obtained via spline interpolation using the real data in (Vicini and Cobelli, 2001); 

3) Gb is fixed to be 84 and D is set to be 300; 4) 5(t) is simulated using a Gaussian profile with 
its support on the positive axis and the standard deviation (SD) randomly drawn from a 
uniform distribution on the interval [0, 1]; 5) Noise is added to the observations with a 2 = 1. 
Each task has 5 measurements chosen uniformly from the interval [1, 240] and an additional 
10 measurements are used for testing. Notice that the approach in (Pillonetto et al., 2010) 
cannot deal the situation efficiently since the inputs do not share samples often. 

The experiments were done under both the single center and the multi center setting and 
the results are shown in Fig. 3. The plots of task distribution on the left suggest that one 
can get more accurate estimation by using multiple centers. For the multiple center case, the 
number of centers for the proposed method is arbitrarily set to be 3 (K = 3) and the number 
of latent function of MGP is set to be 2 (Q = 2) (We were not able of obtain reasonable 

1. The experiment was performed using matlab R2012a on an Intel Core Quo 6600 powered Windows 7 
PC with 4GB memory. 



G(t) 
X(t) 
G(0) 



-[S G + X(t)]G(t) +S G -G b + 5(t) ■ D/V 

-p 2 -X(t)+ P2 -S I -[I(t)-I b ] 

G b , X(0) = 



(36) 



/x= [2.67,6.42,4.82,1.64] 

£ = diag(1.02, 6.90, 2.34, 0.22); 
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results using MGP when (5 = 3). First, we observe that the multi-center version performs 
better than the single center one, indicating that the group-based generalization of the 
traditional mixed-effect model is beneficial. Second, we can see that all the methods achieve 
reasonably good performance, but that the proposed method significantly outperforms the 
other methods. 

6.3. Real Astrophysics Data 

We evaluate our method using the astronomy dataset of (Wang et al., 2010), where a gen- 
erative model was developed to capture and classify different types of stars. The dataset, 
extracted from the OGLEII survey (Soszynski et al., 2003), includes stars of 3 types (RRL, 
CEPH, EB) which constitute 3 datasets in our context. One example of each class is shown 
in Fig. 4. These examples are densely sampled but some stars have less samples and we sim- 
ulate the sparse case by sub-sampling in our experiments. In previous work (Wang et al., 
2010), we developed a grouped mixed-effect multi-task model that in addition allowed for 
phase shift of the light measurements. As shown in (Wang et al., 2010), stars of the same 
type have a range of different shapes and the group structure is useful in modeling this 
domain. However, for inference, Wang et al. (2010) used a simple approach clipping sample 
points to a fine grid of 200 equally spaced points, due to the high dimensionality of the full 
sample (over 18000 points). 




Figure 4: OGLEII: time series for one star (one task in our context) of each type. 



Here we use a random subset of 700 stars (tasks) for each type and preprocess the 
data normalizing each star to have mean and standard deviation 1, and using universal 
phasing (Protopapas et al., 2006) to phase each time series to align the maximum of a 
sliding window of 5% of the original points. For each time series, we randomly sample 10 
examples for training and 10 examples for testing per evaluation of SMSE and MSLL. The 
number of centers is set to be 3 for the proposed approach and for MGP we set Q = 1 
(We were not able to use Q > 1). The results are shown in Fig. 5. We can see that the 
proposed model significantly outperforms all other methods on EB. For Cepheid and RRL 
whose shape is simpler, we see that the error of the proposed model and MGP are very 
close and both outperform other methods. 
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7. Conclusion 

The paper develops an efficient variational learning algorithm for the grouped mixed-effect 
GP for multi-task learning, which compresses the information of all tasks into an optimal 
set of support variables for each mean effect. Experimental evaluation demonstrates the 
effectiveness of the proposed method. In future, it will be interesting to derive an online 
sparse learning algorithm for this model. Another important direction is to investigate 
efficient methods for selection of inducing variables when the input is in high dimensional 
space. In this case, the clipping method of (Wang et al., 2010) is clearly not feasible, but 
the variational procedure can provide appropriate guidance. 
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Figure 2: Synthetic Data: Comparison between the proposed method and other approaches. 

Left Column: Predictive distribution for the fixed-effect. The solid line denotes 
the predictive mean and the corresponding dotted line is the predictive variance. 
The black crosses are the initial value of the inducing variables and the red ones 
are their values after learning process. Right Column: The average SMSE and 
MSLL for all the tasks. 
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Figure 3: Simulated Glucose Data. Left Column: Single center K = 1 results; Right Col- 
umn: Multiple center K = 3 results; Top: 15 tasks (Blue) with observations (Red 
Diamonds) and estimated fixed-effect curve (Green) obtained from 1000 IVGTT 
responses. Although the data is not generated by our model, it can be seen that 
different tasks have a common shape and might be modeled using a fixed effect 
function plus individual variations. Middle: The average SMSE for all tasks; 
Bottom: The average MSLL for all tasks. 
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Figure 5: OGLEII: The average SMSE and MSLL for all the tasks are shown in the Left 
and Right Column. Top: Cepheid; Middle: EB; Bottom: RRL. 
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